e-Chapter 7 


Neural Networks 


Neural networks are a biologically inspired! model which has had consider- 
able engineering success in applications ranging from time series prediction 
to vision. Neural networks are a generalization of the perceptron which uses 
a feature transform that is learned from the data. As such, they are a very 
powerful and flexible model. 

Neural networks are a good candidate model for learning from data because 
they can efficiently approximate complex target functions and they come with 
good algorithms for fitting the data. We begin with the basic properties of 
neural networks, and how to train them on data. We will introduce a variety 
of useful techniques for fitting the data by minimizing the in-sample error. 
Because neural networks are a very flexible model, with great approximation 
power, it is easy to overfit the data; we will study a number of techniques to 
control overfitting specific to neural networks. 


7.1 The Multi-layer Perceptron (MLP) 


The perceptron cannot implement simple classi- 
fication functions. To illustrate, we use the tar- 
get on the right, which is related to the Boolean 
XOR function. In this example, f cannot be 
written as sign(w"x). However, f is composed 
of two linear parts. Indeed, as we will soon see, 
we can decompose f into two simple percep- 
trons, corresponding to the lines in the figure, 
and then combine the outputs of these two per- 
ceptrons in a simple way to get back f. The two 
perceptrons are shown next. 











1The analogy with biological neurons though inspiring should not be taken too far; after 
all, we build planes with wings that do not flap. In much the same way, neural networks, 
when applied to learning from data, do not much resemble their biological counterparts. 
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The target f equals +1 when exactly one of h1,h2 equals +1. This is the 
Boolean XOR function: f = XOR(h1, h2), where +1 represents “TRUE” and —1 
represents “FALSE”. We can rewrite f using the simpler OR and AND operations: 
OR(hi, h2) = +1 if at least one of hy, hg equal +1 and AND(hı, h2) = +1 if 
both hı, hg equal +1. Using standard Boolean notation (multiplication for 
AND, addition for OR, and overbar for negation), 


f = hiha + hiho: 


Exercise 7.1 


Consider a target function f whose ‘+’ and ‘—' regions are illustrated below. 





hs 





Tı Zi Tı 


Show that BS _ B 
f = hihehs + hihehs + hihehs. 


Is there a systematic way of going from a target which is a decomposition of 
perceptrons to a Boolean formula like this? [Hint: consider only the regions 
of f which are ‘+’ and use the disjunctive normal form (OR of ANDs).] 


Exercise 7.1 shows that a complicated target, which is composed of percep- 
trons, is a disjunction of conjunctions (OR of ANDs) applied to the component 
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perceptrons. This is a useful insight, because OR and AND can be implemented 
by the perceptron: 


OR(£z1, £2) = sign(a, + zə + 1.5); 
AND(2‘1, £2) = sign(xı + t2 — 1.5). 


This implies that these more complicated targets are ultimately just combi- 
nations of perceptrons. To see how to combine the perceptrons to get f, we 
introduce a graph representation of perceptrons, starting with OR and AND: 











AND(21, £2) 





A node outputs a value to an arrow. The weight on an arrow multiplies this 
output and passes the result to the next node. Everything coming to this next 
node is summed and then transformed by sign(-) to get the final output. 


Exercise 7.2 


(a) The Boolean OR and AND of two inputs can be extended to more 


than two inputs: OR(a1,...,2.) = +1 if any one of the M inputs 
is +1; AND(z1,...,£m) = +1 if all the inputs equal +1. Give graph 
representations of OR(£1,..., £m) and AND(a1,...,2m). 


(b) Give the graph representation of the perceptron: h(x) = sign(w’x). 


(c) Give the graph representation of OR(#1, T2, 73). 


The MLP fora Complex Target. Since f = hiha + hiho, which is an OR 
of the two inputs hiha and hıh2, we first use the OR perceptron, to obtain: 
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The two inputs hıhz and hiho are ANDs. As such, they can be simulated by 
the output of two AND perceptrons. To deal with negation of the inputs to 
the AND, we negate the weights multiplying the negated inputs (as you have 
done in Exercise 7.1(c)). The resulting graph representation of f is: 

















The blue and red weights are simulating the required two ANDs. Finally, since 
hı = sign(w]x) and h2 = sign(w3x) are perceptrons, we further expand the 
hı and hg nodes to obtain the graph representation of f. 




















The next exercise asks you to compute an explicit algebraic formula for f. 
The visual graph representation is much neater and easier to generalize. 


Let’s compare the graph form of f with 
the graph form of the simple perceptron, 
shown to the right. More layers of nodes 
are used between the input and output 
to implement f, as compared to the sim- 
ple perceptron, hence we call it a multi- 
layer layer perceptron (MLP),. The ad- 


Exercise 7.3 


Use the graph representation to get an explicit formula for f and show that: 


f(x) = sign [sign (Ina (x) — hax) — $) — sign(/n (x) — ha(x) + $) + 3], 


where hi(x) = sign(w7x) and h2(x) = sign(w2x) 








ditional layers are called hidden layers. 


© 





wa 











Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 








AT) > sign(w7x) 


e-Chap:7—4 


e-7. NEURAL NETWORKS 7.1. THE MULTI-LAYER PERCEPTRON (MLP) 


Notice that the layers feed forward into the next layer only (there are no back- 
ward pointing arrows and no jumps to other layers). The input (leftmost) 
layer is not counted as a layer, so in this example, there are 3 layers (2 hid- 
den layers with 3 nodes each, and an output layer with 1 node). The simple 
perceptron has no hidden layers, just an input and output. The addition of 
hidden layers is what allowed us to implement the more complicated target. 


Exercise 7.4 


For the target function in Exercise 7.1, give the MLP in graphical form, as 
well as the explicit algebraic form. 


If f can be decomposed into perceptrons using an OR of ANDs, then it can be 
implemented by a 3-layer perceptron. If f is not strictly decomposable into 
perceptrons, but the decision boundary is smooth, then a 3-layer perceptron 
can come arbitrarily close to implementing f. A ‘proof by picture’ illustration 
for a disc target function follows: 




















Target 8 perceptrons 16 perceptrons 


The formal proof is somewhat analogous to the theorem in calculus which says 
that any continuous function on a compact set can be approximated arbitrarily 
closely using step functions. The perceptron is the analog of the step function. 

We have thus found a generalization of the simple perceptron that looks 
much like the simple perceptron itself, except for the addition of more layers. 
We gained the ability to model more complex target functions by adding more 
nodes (hidden units) in the hidden layers — this corresponds to allowing more 
perceptrons in the decomposition of f. In fact, a suitably large 3-layer MLP 
can closely approximate just about any target function, and fit any data set, 
so it is a very powerful learning model. Use it with care. If your MLP is too 
large you may lose generalization ability. 

Once you fix the size of the MLP (number of hidden layers and number 
of hidden units in each layer), you learn the weights on every link (arrow) by 
fitting the data. Let’s consider the simple perceptron, 


h(x) = 0(w*x). 
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When @(s) = sign(s), learning the weights was already a hard combinatorial 
problem and had a variety of algorithms, including the pocket algorithm, for 
fitting data (Chapter 3). The combinatorial optimization problem is even 
harder with the MLP, for the same reason, namely that the sign(-) function is 
not smooth; a smooth, differentiable approximation to sign(-) will allow us to 
use analytic methods, rather than purely combinatorial methods, to find the 
optimal weights. We therefore approximate, or ‘soften’ the sign(-) function by 
using the tanh(-) function. The MLP is sometimes called a (hard) threshold 
neural network because the transformation function is a hard threshold at zero. 
Here, we choose 0(x) = tanh(x) which is in- 
between linear and the hard threshold: nearly 
linear for x ~ 0 and nearly +1 for |x| large. The 
tanh(-) function is another example of a sigmoid 
(because its shape looks like a flattened out ‘s’), 
related to the sigmoid we used for logistic regres- 
sion. Such networks are called sigmoidal neu- 
ral networks. Just as we could use the weights 
learned from linear regression for classification, we could use weights learned 
using the sigmoidal neural network with tanh(-) activation function for classi- 
fication by replacing the output activation function with sign(-). 











Exercise 7.5 


Given wi and e€ > 0, find w2 such that |sign(w7xn) — tanh(w3xn)| < € 
for xn E€ D. [Hint: For large enough a, sign(x) ~ tanh(az).] 


The previous example shows that the 

sign(-) function can be closely approxi- —sign 
mated by the tanh(-) function. A concrete =n 
illustration of this is shown in the figure 
to the right. The figure shows how the 
in-sample error Ej, varies with one of the 
weights in w on an example problem for 
the perceptron (blue curve) as compared 
to the sigmoidal version (red curve). The 
sigmoidal approximation captures the gen- 
eral shape of the error, so that if we minimize the sigmoidal in-sample error, 
we get a good approximation to minimizing the in-sample classification error. 








WwW 


?In logistic regression, we used the sigmoid because we wanted a probability as the output. 
Here, we use the ‘soft’ tanh(-) because we want a friendly objective function to optimize. 
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7.2 Neural Networks 


The neural network is our ‘softened’ MLP. Let’s begin with a graph represen- 
tation of a feed-forward neural network (the only kind we will consider). 











input layer £ = 0 hidden layers 0 < £ < L output layer £ = L 


The graph representation depicts a function in our hypothesis set. While this 
graphical view is aesthetic and intuitive, with information ‘flowing’ from the 
inputs on the far left, along links and through hidden nodes, ultimately to the 
output h(x) on the far right, it will be necessary to algorithmically describe 
the function being computed. Things are going to get messy, and this calls for 
a very systematic notation; bear with us. 


7.2.1 Notation 


There are layers labeled by £ = 0,1,2,..., L. In our example above, L = 3, i.e. 
we have three layers (the input layer ¢ = 0 is usually not considered a layer 
and is meant for feeding in the inputs). The layer ¢ = L is the output layer, 
which determines the value of the function. The layers in between, 0 < £< L, 
are the hidden layers. We will use superscript to refer to a particular layer. 
Each layer @ has ‘dimension’ d“, which means that it has d + 1 nodes, 
labeled 0,1,...,d. Every layer has one special node, which is called the bias 
node (labeled 0). This bias node is set to have an output 1, which is analogous 
to the fictitious x) = 1 convention that we had for linear models. 

Every arrow represents a weight or connection strength from a node in a 
layer to a node in the next higher layer. Notice that the bias nodes have no 
incoming weights. There are no other connection weights.” A node with an 
incoming weight indicates that some signal is fed into this node. Every such 
node with an input has a transformation function 0. If 0(s) = sign(s), then we 
have the MLP for classification. As we mentioned before, we will be using a soft 
version of the MLP with (x) = tanh(x) to approximate the sign(-) function. 
The tanh(-) is a soft threshold or sigmoid, and we already saw a related sigmoid 

3In a more general setting, weights can connect any two nodes, in addition to going 


backward (i.e., one can have cycles). Such networks are called recurrent neural networks, 
and we do not consider them here. 
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when we discussed logistic regression in Chapter 3. Ultimately, when we do 
classification, we replace the output sigmoid by the hard threshold sign(-). 
As a comment, if we were doing regression instead, our entire discussion goes 
through with the output transformation being replaced by the identity function 
(no transformation) so that the output is a real number. If we were doing 
logistic regression, we would replace the output tanh(-) sigmoid by the logistic 
regression sigmoid. 


The neural network model Hnn is specified once you determine the ar- 
chitecture of the neural network, that is the dimension of each layer d = 
[d d®,...,d®)] (L is the number of layers). A hypothesis h € Hnn is 
specified by selecting weights for the links. Let’s zoom into a node in hidden 
layer £, to see what weights need to be specified. 





layer (¢— 1) layer £ 


A node has an incoming signal s and an output x. The weights on links into 
the node from the previous layer are w, so the weights are indexed by the 
layer into which they go. Thus, the output of the nodes in layer Z — 1 is 
multiplied by weights w. We use subscripts to index the nodes in a layer. 
A is the weight into node 7 in layer £ from node i in the previous layer, 
the signal going into node j in layer £ is na and the output of this node 
is W. There are some special nodes in the network. The zero nodes in every 
layer are constant nodes, set to output 1. They have no incoming weight, but 
they have an outgoing weight. The nodes in the input layer £ = 0 are for the 


input values, and have no incoming weight or transformation function. 


So, w 


For the most part, we only need to deal with the network on a layer by layer 
basis, so we introduce vector and matrix notation for that. We collect all the 
input signals to nodes 1,...,d in layer £ in the vector s. Similarly, collect 
the output from nodes 0,...,d© in the vector x; note that x € {1}x RW 
because of the bias node 0. There are links connecting the outputs of all 
nodes in the previous layer to the inputs of layer Z. So, into layer £, we have 
a (d@—-) + 1) x d®© matrix of weights WM. The (i, 7)-entry of W™ is we? 
going from node i in the previous layer to node j in layer £. 
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wl) 

















layer (£ — 1) layer £ layer (£+ 1) 


layer £ parameters 


signals in s d dimensional input vector 
outputs x d® +1 dimensional output vector 
weights in | WY (d@—-) +1) x d® dimensional matrix 





weights out | W(t) (d® +1) x d+) dimensional matrix 


After you fix the weights W for @=1,..., L, you have specified a particular 
neural network hypothesis h € Hnn. We collect all these weight matrices into 
a single weight parameter w = {W, W@),...,W()}, and sometimes we will 
write h(x; w) to explicitly indicate the dependence of the hypothesis on w. 


7.2.2 Forward Propagation 


The neural network hypothesis h(x) is computed by the forward propagation 
algorithm. First observe that the inputs and outputs of a layer are related by 
the transformation function, 


x® = laeo l (7.1) 


where 6(s“)) is a vector whose components are o(a%). To get the input vector 


into layer £, we compute the weighted sum of the outputs from the previous 
; : : ; O. (L _ [sad (4) (2—1) 

layer, with weights specified in W(®: s = Xù Wij Ti 

is compactly represented by the matrix equation 


. This process 


s® = (w)TxE), (7.2) 


All that remains is to initialize the input layer to x) = x (so d© = d, the 
input dimension)* and use Equations (7.2) and (7.1) in the following chain, 





x = x) we s 2, x w s2 5 x o sD 9, x) = h(x). 


4Recall that the input vectors are also augmented with xo = 1. 
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Forward propagation to compute h(x): 
: x0 x [Initialization] 
: for l= 1 to L do [Forward Propagation] 
s) + (WMY)Tx@—D 


x0 [ato] 
: h(x) =x) 









































layer (¢€— 1) 





After forward propagation, the output vector x at every layer l = 0,... 
has been computed. 
Exercise 7.6 


Let V and Q be the number of nodes and weights in the neural network, 


L L 
Ved Q= ed a ea) 
g=. 


L= 


In terms of V and Q, how many computations are made in forward propa- 
gation (additions, multiplications and evaluations of 0). 


[Answer: O(Q) multiplications and additions, and O(V ) 0-evaluations.] 


If we want to compute Ein, all we need is h(x,) and yn. For the sum of 
squares, 


1 N 
Ein(w) = N N (h(x; w) a Yn)? 


1 N 
= La! — 
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We now discuss how to minimize Ein to obtain the learned weights. It will be a 
direct application of gradient descent, with a special algorithm that computes 
the gradient efficiently. 


7.2.3 Backpropagation Algorithm 


We studied an algorithm for getting to a local minimum of a smooth in-sample 
error surface in Chapter 3, namely gradient descent: initialize the weights to 
w(0) and for t = 1,2,... update the weights by taking a step in the negative 
gradient direction, 


w(t + 1) = w(t) — nY Eim (w(t) 


we called this (batch) gradient descent. To implement gradient descent, we 
need the gradient. 


Exercise 7.7 


For the sigmoidal perceptron, h(x) = tanh(w’x), let the in-sample error 
be Ein(w) = 4 S _ (tanh(w*xn) — yn)”. Show that 


x (tanh(w™xn) — yn)(1 — tanh? (w*Xn))Xn- 


If w — oo, what happens to the gradient; how this is related to why it is 
hard to optimize the perceptron. 


We now consider the sigmoidal multi-layer neural network with 0(x) = tanh(x). 
Since h(x) is smooth, we can apply gradient descent to the resulting error func- 
tion. To do so, we need the gradient V Ein(w). Recall that the weight vector w 
contains all the weight matrices w,...,W“), and we need the derivatives 
with respect to all these weights. Unlike the sigmoidal perceptron in Exer- 
cise 7.7, for the multilayer sigmoidal network there is no simple closed form 
expression for the gradient. Consider an in-sample error which is the sum of 
the point-wise errors over the data points (as is the squared in-sample error), 


Ein( D En. 


where en = e(h(Xn),Yn). For the squared error, e(h,y) = (h — y)?. To 
compute the gradient of Ein, we need its derivative with respective to each 
weight matrix: 








(7.3) 





The basic building block in (7.3) is the idea derivative of the aa on a 


data point e, with respect to the W™. A quick and dirty way to get ion is 
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to use the numerical finite difference approach. The complexity of obtaining 
the partial derivatives with respect to every weight is O(Q?), where Q is the 
number of weights (see Problem 7.6). From (7.3), we have to compute these 
derivatives for every data point, so the numerical approach is computation- 
ally prohibitive. We now derive an elegant dynamic programming algorithm 
known as backpropagation. Backpropagation allows us to compute the partial 
derivatives with respect to every weight efficiently, using O(Q) computation. 
We describe backpropagation for getting the partial derivative of the error e, 
but the algorithm is general and can be used to get the partial derivative of 
any function of the output h(x) with respect to the weights. 

Backpropagation is based on several applications of the chain rule to write 
partial derivatives in layer @ using partial derivatives in layer (¢+ 1). To 
describe the algorithm, we define the sensitivity vector for layer £, which is 
the sensitivity (gradient) of the error e with respect to the input signal s“ 
that goes into layer £. We denote the sensitivity by 6, 


Oe 
4) — 
Oo = O 


The sensitivity quantifies how e changes with s“. Using the sensitivity, we 
can write the partial derivatives with respect to the weights W™ as 





Oe = > 
aK = x) (OT. (7.4) 


We will derive this formula later, but for now let’s examine it closely. The 
partial derivatives on the left form a matrix with dimensions (d~-) +1) x d® 
and the ‘outer product’ of the two vectors on the right give exactly such a ma- 
trix. The partial derivatives have contributions from two components. (i) The 
output vector of the layer from which the weights originate; the larger the 
output, the more sensitive e is to the weights in the layer. (ii) The sensitivity 
vector of the layer into which the weights go; the larger the sensitivity vector, 
the more sensitive e is to the weights in that layer. 

The outputs x for every layer 2 > 0 can be computed by a forward 
propagation. So to get the partial derivatives, it suffices to obtain the sen- 
sitivity vectors 6 for every layer l > 1 (remember that there is no input 
signal to layer ¢ = 0). It turns out that the sensitivity vectors can be obtained 
by running a slightly modified version of the neural network backwards, and 
hence the name backpropagation. In forward propagation, each layer outputs 
the vector x and in backpropagation, each layer outputs (backwards) the 
vector 6, In forward propagation, we compute x from x@~) and in back- 
propagation, we compute 6 from 6+"), The basic idea is illustrated in the 
following figure. 


5Dynamic programming is an elegant algorithmic technique in which one builds up a 
solution to a complex problem using the solutions to related but simpler problems. 
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< (é+1) 
wed é 














0.000 

















layer £ layer (€ + 1) 


As you can see in the figure, the neural network is slightly modified only in 
that we have changed the transformation function for the nodes. In forward 
propagation, the transformation was the sigmoid @(-). In backpropagation, 
the transformation is multiplication by 6'(s), where s is the input to the 
node. So the transformation function is now different for each node, and 
it depends on the input to the node, which depends on x. This input was 
computed in the forward propagation. For the tanh(-) transformation function, 
tanh’(s) = 1—tanh?(s) = 1—x(@x(, where @ denotes component-wise 
multiplication. 


In the figure, layer (€+1) outputs (backwards) the sensitivity vector 6+), 
which gets multiplied by the weights W+), summed and passed into the 
nodes in layer £. Nodes in layer £ multiply by 6’(s) to get 6. Using Q, a 
shorthand notation for this backpropagation step is: 


6 = 6's) @ [wD gerne? (7.5) 


£ 

where the vector [W411 6()] A ? contains components 1,...,d of the vec- 
tor WCD ECHI (excluding the bias component which has index 0). This for- 
mula is not surprising. The sensitivity of e to inputs of layer @ is proportional 
to the slope of the activation function in layer £ (bigger slope means a small 
change in s will have a larger effect on x), the size of the weights going 
out of the layer (bigger weights mean a small change in s‘ will have more 
impact on s‘+1)) and the sensitivity in the next layer (a change in layer ¢ 
affects the inputs to layer Z + 1, so if e is more sensitive to layer + 1, then it 
will also be more sensitive to layer £). 


We will derive this backward recursion later. For now, observe that if we 
know 64+), then you can get 6. We use 5“) to seed the backward process, 
and we can get that explicitly because e = (x) — y)? = (0(s“) — y}. 
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Therefore, 
Oe 
CE) jake ee 
gat Se 
ð L 2 
= 5atey )— y) 
Ox) 
= (L) _ 


= Yx” — y) (s). 


When the output transformation is tanh(-), 6’(s“)) = 1 — (x)? (classifica- 
tion); when the output transformation is the identity (regression), 0' (s) = 1. 
Now, using (7.5), we can compute all the sensitivities: 


6) — §@ 2... e ED a D., 


Note that since there is only one output node, s% is a scalar, and so too is 
6). The algorithm box below summarizes backpropagation. 


Backpropagation to compute sensitivities 6. 
Input: a data point (x,y). 
0: Run forward propagation on x to compute and save: 


for 2=1,...,D; 
x for @=0,...,L. 


» OE) 4 (a) — y)o’(s)) [Initialization] 


8’ (s9) > f — (z)? Z A 


: for l= L—1to1 do [Back-Propagation] 
Let 6'(s) = [1 - x® @ x] J : 


Compute the sensitivity 6 from 6+): 





6 Os) @ [wegen] 


In step 3, we assumed tanh-hidden node transformations. If the hidden unit 
transformation functions are not tanh(-), then the derivative in step 3 should 
be updated accordingly. Using forward propagation, we compute x for 
l = 0,...,L and using backpropagation, we compute 6 for £ = 1,...,L. 
Finally, we get the partial derivative of the error on a single data point using 
Equation (7.4). Nothing illuminates the moving parts better than working an 
example from start to finish. 
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Example 7.1. Consider the following neural network. 





0.2 
qa) |01 0.2], (2) _ G NE 
Ww = lie gaj? = a ae r 








We have explicitly shown how 6°) is obtained from 6@). It is now a simple 
matter to combine the output vectors x with the sensitivity vectors 6“ 
using (7.4) to obtain the partial derivatives that are needed for the gradient: 


ðe = x (5) = —0.44 0.88], ôe _ al ðe [| —1.85 
aw®) ~ © [-0.88 1.75]’ AWE) | peg}? awe) | 1.67 |" 

















Exercise 7.8 


Repeat the computations in Example 7.1 for the case when the output trans- 
formation is the identity. You should compute s“), x , 6 and de/AW 
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Let’s derive (7.4) and (7.5), which are the core equations of backpropagation. 
There’s nothing to it but repeated application of the chain rule. If you wish 
to trust our math, you won’t miss much by moving on. 


Begin safe skip: If you trust our math, you can skip 


this part without compromising the logical sequence. 
A similar green box will tell you when to rejoin. 





To begin, let’s take a closer look at the partial derivative, de/9W™. The 
situation is illustrated in Figure 7.1. 


en Ww © © Kee: 


layer (£ — 1) layer £ layer (£+ 1) 


+ 








Figure 7.1: Chain of dependencies from W to x. 


We can identify the following chain of dependencies by which W influences 
the output x), and hence the error e. 


wl) Ss 5) — x) => s+) Eri = xD) =h. 


To derive (7.4), we drill down to a single weight and use the chain rule. For a 
single weight we, a change in wi? only affects ga and so by the chain rule, 





= re 5, 


where the last equality follows because g = = ae lee and by defi- 


a=0 
nition of 59. We have derived the component form of (7.4). 


We now derive the component form of (7.5). Since e depends on s‘ only 
through x (see Figure 7.1), by the chain rule, we have: 





£ 
poa o d yd 
J 


as” 7 ax as” 7 
To get the partial derivative 0e/Ox‘, we need to understand how e changes 
due to changes in x. Again, from Figure 7.1, a change in x only affects 
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s+) and hence e. Because a particular component of x can affect every 


component of s+), we need to sum these dependencies using the chain rule: 


dtd q@t) 
Be Ss Oa 88 _"S> feengteen 
ax rot ax as et) — jk k 


Putting all this together, we have arrived at the component version of (7.5) 


qeth) 
310 = 00) T a, To 
k=1 


Intuitively, the first term comes from the impact of s on x; the summation 
is the impact of x on s+), and the impact of s@+) on h is what gives us 
back the sensitivities in layer (Z + 1), resulting in the backward recursion. 


End safe skip: Those who skipped are now rejoining 


us to discuss how backpropagation gives us V Fin. 





Backpropagation works with a data point (x, y) and weights w = {W,...,W}. 


Since we run one forward and backward propagation to compute the outputs 
x and the sensitivities 6, the running time is order of the number of 
weights in the network. We compute once for each data point (Xn, Yn) to 
get VEin(Xn) and, using the sum in (7.3), we aggregate these single point 
gradients to get the full batch gradient V Ein. We summarize the algorithm 
below. 


Algorithm to Compute Fin(w) and g = VEin(w). 
Input: w = {W®,..., W}; D = (x1, y1)... (KN, Yn): 
Output: error Fj,(w) and gradient g = {G™,...,G()}. 

1: Initialize: Ein = 0 and G® =0-W for L= 1,..., L. 

2: for Each data point (Xn, yn), n = 1,..., N, do 

3: Compute x for 2=0,...,L. [forward propagation] 

4: Compute 6 for l = L,...,1. [backpropagation] 

5: Ey s Ein =F A(x) z Yn). 

6: for£l=1,...,L do 

T GO Gn) = bY (30) 
8 


GO e GO + EGO (xn) 





(GO (xn) is the gradient on data point xn). The weight update for a single 
iteration of fixed learning rate gradient descent is W® — WM — 7G, for 
£= 1,...,L. We do all this for one iteration of gradient descent, a costly 
computation for just one little step. 
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In Chapter 3, we discussed stochas- 

tic gradient descent (SGD) as a more Gradient Descent 
efficient alternative to the batch mode. 
Rather than wait for the aggregate gra- 
dient G® at the end of the iteration, one 
immediately updates the weights as each 
data point is sequentially processed using 
the single point gradient in step 7 of the 
algorithm: WY = W® — nGM(x,). In 0 ZU 6 

: i ; A log, (iteration) 
this sequential version, you still run a for- 
ward and backward propagation for each 
data point, but make N updates to the weights. A comparison of batch gra- 
dient descent with SGD is shown to the right. We used 500 training examples 
from the digits data and a 2-layer neural network with 5 hidden units and 
learning rate 7 = 0.01. The SGD curve is erratic because one is not minimiz- 
ing the total error at each iteration, but the error on a specific data point. 
One method to dampen this erratic behavior is to decrease the learning rate 
as the minimization proceeds. 


log, 9(error) 








The speed at which you minimize Ej, can depend heavily on the optimiza- 
tion algorithm you use. SGD appears significantly better than plain vanilla 
gradient descent, but we can do much better — even SGD is not very efficient. 
In Section 7.5, we discuss some other powerful methods (for example, conju- 
gate gradients) that can significantly improve upon gradient descent and SGD, 
by making more effective use of the gradient. 


Initialization and Termination. Choosing the initial weights and decid- 
ing when to stop the gradient descent can be tricky, as compared with logistic 
regression, because the in-sample error is not convex anymore. From Exer- 
cise 7.7, if the weights are initialized too large so that tanh(w’x,) ~ +1, 
then the gradient will be close to zero and the algorithm won’t get any- 
where. This is especially a problem if you happen to initialize the weights 
to the wrong sign. It is usually best to initialize the weights to small ran- 
dom values where tanh(w’x,,) + 0 so that the algorithm has the flexibility to 
move the weights easily to fit the data. One good choice is to initialize using 
Gaussian random weights, w; ~ N(0,0?,) where o?, is small. But how small 


should o?, be? A simple heuristic is that we want |w™x,|? to be small. Since 
Sw [|WXn|?] = 02,||Xn||?, we should choose 02, so that o2, - max» ||xn||?7 < 1. 

















Exercise 7.9 


What can go wrong if you just initialize all the weights to exactly zero? 
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When do we stop? It is risky to rely solely on 

the size of the gradient to stop. As illustrated on 

the right, you might stop prematurely when the (aus 
iteration reaches a relatively flat region (which £ 

; ; S 

is more common than you might suspect). A 

combination of stopping criteria is best in prac- 

tice, for example stopping only when there is Weights, w 


marginal error improvement coupled with small 
error, plus an upper bound on the number of iterations. 





7.2.4 Regression for Classification 


In Chapter 3, we mentioned that you could use the weights resulting from 
linear regression as perceptron weights for classification, and you can do the 
same with neural networks. Specifically, fit the classification data (yn = +1) as 
though it were a regression problem. This means you use the identity function 
as the output node transformation, instead of tanh(-). This can be a great 
help because of the ‘flat regions’ which the network is susceptible to when 
using gradient descent, which happens often in training. The reason for these 
flat periods in the optimization is the exceptionally flat nature of the tanh 
function when its argument gets large. If for whatever reason the weights get 
large toward the beginning of the training, then the error surface begins to look 
flat, because the tanh has been saturated. Now, gradient descent cannot make 
any progress and you might think you are at a minimum, when in fact you 
are far from a minimum. The problem of a flat error surface is considerably 
mitigated when the output transformation is the identity because you can 
recover from an initial bad move if it happens to take you to large weights 
(the linear output never saturates). For a concrete example of a prematurely 
flat in-sample error, see the figure in Example 7.2 on page 25. 





7.3 Approximation versus Generalization 


A large enough MLP with 2 hidden layers can 
approximate smooth decision functions arbitrar- 
ily well. It turns out that a single hidden layer 
suffices. A neural network with a single hidden 
layer having m hidden units (d® = m) imple- 
ments a function of the form 


m d 
h(x) =0 wo) + Swe (>: wf) 
j=l i=0 





Though one hidden layer is enough, it is not necessarily the most efficient way to fit the 
data; for example a much smaller 2-hidden-layer network may exist. 
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This is a cumbersome representation for such a simple network. A simplified 
notation for this special case is much more convenient. For the second-layer 
weights, we will just use wo, w1, ..., Wm and we will use v; to denote the jth 
column of the first layer weight matrix W“), for j = 1...m. With this simpler 
notation, the hypothesis becomes much more pleasant looking: 


h(x) =0 | wot 5 w0 (v;x) 


j=1 


Neural Network versus Nonlinear Transforms. Recall the linear model 
from Chapter 3, with nonlinear transform ®(x) that transforms x to z: 


x > z = (x) = [1, ¢1 (x), 92 (x), - -m ( X)|". 


The linear model with nonlinear transform is a hypothesis of the form 


M 
h(x) =90 | wo + Y w 70; (x) 


j=1 


The ¢ġ;(-) are called basis functions. On face value, the neural network and the 
linear model look nearly identical, by setting 0(v; x) = $;(x). There is a subtle 
difference, though, and this difference has a big practical impact. With the 
nonlinear transform, the basis functions ¢;(-) are fixed ahead of time before 
you look at the data. With the neural network, the ‘basis function’ 6(vjx) 
has a parameter vj; inside, and we can tune v; after seeing the data. First, 
this has a computational impact because the parameter vj appears inside 
the nonlinearity 6(-); the model is no longer linear in its parameters. We 
saw a similar effect with the centers of the radial basis function network in 
Chapter 6. Models which are nonlinear in their parameters pose a significant 
computational challenge when it comes to fitting to data. Second, it means 
that we can tune the basis functions to the data. Tunable basis functions, 
although computationally harder to fit to data, do give us considerably more 
flexibility to fit the data than do fixed basis functions. With m tunable basis 
functions one has roughly the same approximation power to fit the data as with 
mê fixed basis functions. For large d, tunable basis functions have considerably 
more power. 


Exercise 7.10 


It is no surprise that adding nodes in the hidden layer gives the neural net- 
work more approximation ability, because you are adding more parameters. 


How many weight parameters are there in a neural network with architecture 
specified by d = [d©, OD, cae dd), a vector giving the number of nodes 
in each layer? Evaluate your formula for a 2 hidden layer network with 10 
hidden nodes in each hidden layer. 
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Approximation Capability of the Neural Network. It is possible to 
quantify how the approximation ability of the neural network grows as you 
increase m, the number of hidden units. Such results fall into the field known 
as functional approximation theory, a field which, in the context of neural net- 
works, has produced some interesting results. Usually one starts by making 
some assumption about the smoothness (or complexity) of the target func- 
tion f. On the theoretical side, you have lost some generality as compared 
with, for example, the VC-analysis. However, in practice, such assumptions 
are OK because target functions are often smooth. If you assume that the 
data are generated by a target function with complexity’ at most Cy, then 
a variety of bounds exist on how small an in-sample error is achievable with 
m hidden units. For regression with squared error, one can achieve in-sample 
error 


N 2 
Bin(h) = E Phn) ~ gn)? < PRED 


m 


where R = max, ||xn]| is the ‘radius’ of the data. The in-sample error decreases 
inversely with the number of hidden units. For classification, a similar result 
with slightly worse dependence on m exists. With high probability, 


Bin < Eby + OORT), 


where E>, is the out-of-sample error of the optimal classifier that we discussed 
in Chapter 6. The message is that Ein can be made small by choosing a large 
enough hidden layer. 


Generalization and the VC-Dimension. For sufficiently large m, we can 
get Ein to be small, so what remains is to ensure that Fin ~ Eout. We need to 
look at the VC-dimension. For the two layer hard-threshold neural network 
(MLP) where 0(x) = sign(x), we show a simple bound on the VC dimension: 


dvo < (const) - mdlog(md). (7.7) 


For a general sigmoid neural network, dvo can be infinite. For the tanh(-) 
sigmoid, with sign(-) output node, one can show that dvo = O(VQ) where V 
is the number of hidden nodes and Q is the number of weights; for the two 
layer case 

dvc = O(md(m + d)). 


The tanh(-) network has higher VC-dimension than the 2-layer MLP, which 
is not surprising because tanh(-) can approximate sign(-) by choosing large 
enough weights. So every dichotomy that can be implemented by the MLP 
can also be implemented by the tanh(-) neural network. 

TWe do not describe details of how the complexity of a target can be measured. One 


measure is the size of the ‘high frequency’ components of f in its Fourier transform. Another 
more restrictive measure is the number of bounded derivatives f has. 
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To derive (7.7), we will actually show a 


more general result. Consider the hypothesis (741) 

set illustrated by the network on the right. 

Hidden node i in the hidden layer imple- 

ments a function h; € Hi which maps Rt (742) 

to {+1,—1}; the output node implements O (i) os) 


a function he € Ho which maps R™ to 

{+1,-1}. This output node combines the 

outputs of the hidden layer nodes to imple- © 
ment the hypothesis 


h(x) = holh (x), .-. , hm (x)). 


(For the 2-layer MLP, all the hypothesis sets are perceptrons.) 

Suppose the VC-dimension of H; is d; and the VC-dimension of Hc is de. 
Fix xj,...,Xy, and the hypotheses hı,...,hm implemented by the hidden 
nodes. The hypotheses hı,...,hm are now fixed basis functions defining a 
transform to R”, 


hı(xı) hi(xn) 
> ina Z1 = : ° XN ZN = : 
Nin (X1) hmn(xw) 


The transformed points are binary vectors in R™. Given h1,..., Am, the points 
X1,... Xy are transformed to an arrangement of points z1,...,zy in R”. 
Using our flexibility to choose hy,...,m, we now upper bound the number 
of possible different arrangements z1,...,ZN we can get. 

The first components of all the Z, are given by hi(x1),...,hi(xn), which 
is a dichotomy of x;,...,xy implemented by hı. Since the VC-dimension 
of H is d,, there are at most N% such dichotomies. That is, there are at 
most N@ different ways of choosing assignments to all the first components 
of the zn. Similarly, an assignment to all the ith components can be chosen 
in at most N% ways. Thus, the total number of possible arrangements for 
Z1,---,ZNn is at most 


TL“ = NUE, 
i=l 


Each of these arrangements can be dichotomized in at most N@ ways, since the 
VC-dimension of Hc is de. Each such dichotomy for a particular arrangement 
gives one dichotomy of the data x,,...,xy. Thus, the maximum number of 
different dichotomies we can implement on x),...,xy is upper bounded by 
the product: the number of possible arrangements times the number of ways 


8Recall that for any hypothesis set with VC-dimension dvc and any N > dvc, m(N) (the 
maximum number of implementable dichotomies) is bounded by (eN/dyc)*”° < N¢ (for 
the sake of simplicity we assume that dvc > 2). 
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of dichotomizing a particular arrangement. We have shown that 
m(N) < N&. NX di — Niti di, 


Let D = de + 7", di. After some algebra (left to the reader), if N > 
2D log, D, then m(N) < 2%, from which we conclude that dvc < 2D log, D. 
For the 2-layer MLP, di = d+ 1 and de = m + 1, and so we have that 
D = de +X; di = Mm(d +2) + 1 = O(md). Thus, dvo = O(mdlog(md)). 
Our analysis looks very crude, but it is almost tight: it is possible to shatter 
Q(md) points with m hidden units (see Problem 7.16), and so the upper bound 
can be loose by at most a logarithmic factor. Using the VC-dimension, the 
generalization error bar from Chapter 2 is O(,/(dvc log N)/N) which for the 
2-layer MLP is O(,/(md log(md) log N)/N). 

We will get good generalization if m is not too large and we can fit the data 
if m is large enough. A balance is called for. For example, choosing m = iVN 
as N + œ, Eout > Ein and Ein > Eža That is, Bout > EX, (the optimal 
performance) as N grows, and m grows sub-linearly with N. In practice 
the ‘asymptotic’ regime is a luxury and one does not simply set m ~ VN. 
These theoretical results are a good guideline, but the best out-of-sample 
performance usually results when you control overfitting using validation (to 
select the number of hidden units) and regularization to prevent overfitting. 

We conclude with a note on where neural networks sit in the parametric- 
nonparametric debate. There are explicit parameters to be learned, so para- 
metric seems right. But distinctive features of nonparametric models also 
stand out: the neural network is generic and flexible and can realize optimal 
performance when N grows. Neither parametric nor nonparametric captures 
the whole story. We choose to label neural networks as semi-parametric. 


7.4 Regularization and Validation 


The multi-layer neural network is powerful, and, coupled with gradient descent 
(a good algorithm to minimize Fin), we have a recipe for overfitting. We 
discuss some practical techniques to help. 


7.4.1 Weight Based Complexity Penalties 


As with linear models, one can regularize the learning using a complexity 
penalty by minimizing an augmented error (penalized in-sample error). The 
squared weight decay regularizer is popular, having augmented error: 


Eaug(w) = Ein(w) + N Sw)? 
Lij 


The regularization parameter À is selected via validation, as discussed in Chap- 
ter 4. To apply gradient descent, we need VEaug(w). The penalty term adds 
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to the gradient a term proportional to weights, 
OEaug(w) _ OFin(w) 2X 
WO — aw N 
We know how to obtain 0Ein/OW™ using backpropagation. The penalty term 
adds a component to the weight update that is in the negative direction of w, 
i.e. towards zero weights — hence the term weight decay. 
Another similar regularizer is weight elimination, ~ augmented error: 


Faug (w, A) = Ein( tE SS T+ (wi 
zij l 


w, 


For a small weight, the penalty term is much like weight decay, and will decay 
that weight to zero. For a large weight, the penalty term is approximately 
a constant, and contributes little to the gradient. Small weights decay faster 
than large weights, and the effect is to ‘eliminate’ those smaller weights. 


Exercise 7.11 
(¢) 
For weight elimination, show that Eis = Sa A, H 
dul dul) N+ (wf)? 


ij 
Argue that weight elimination shrinks small weights faster than large ones. 


7.4.2 Early Stopping 


Another method for regularization, which on face value does not look like regu- 
larization is early stopping. An iterative method such as gradient descent does 
not explore your full hypothesis set all at once. With more iterations, more 
of your hypothesis set is explored. This means that by using fewer iterations, 
you explore a smaller hypothesis set and should get better generalization.’ 
Consider fixed-step gradient descent with step 
size 7. At the first step, we start at weights wo, Wi = Wo — TTB 
and take a step of size 7 to wi = Wo — nT — 
Because we have taken a step in the direction of 
the negative gradient, we have ‘looked at’ all the 
hypotheses in the shaded region shown on the right. Hai 
This is because a step in the negative gradient leads to the sharpest decrease in 
Eji,(w), and so w minimizes FEin(w) among all weights with ||w — wol| < n. 
We indirectly searched the entire hypothesis set 


Wo 


Hi = {w: ||w— woll < n}, 
and picked the hypothesis wı € Hı with minimum in-sample error. 


9Tf we are to be sticklers for correctness, the hypothesis set explored could depend on the 
data set and so we cannot directly apply the VC analysis which requires the hypothesis set 
to be fixed ahead of time. Since we are just illustrating the main idea, we will brush such 
technicalities under the rug. 





© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:7—24 











e-7. NEURAL NETWORKS 7.4. REGULARIZATION AND VALIDATION 


Now consider the second step, as illustrated to the 
right, which moves to wə. We indirectly explored the 


hypothesis set of weights with ||w — wi|| < n, picking He 

the best. Since wı was already the minimizer of Ein aa 

over Ho, this means that wə is the minimizer of Fin F is 
among all hypotheses in H2, where Sin 


Ho = Hi U{w: |lw — will <n}. 
Note that Hı C Hz. Similarly, we define hypothesis set 
Hs = H2U{w: ||w -= well < n}, wi 


and in the 3rd iteration, we pick weights w3 than min- 

imize Ei, over w € H3. We can continue this argument Ws 
as gradient descent proceeds, and define a nested se- 

quence of hypothesis sets 








Hy < Ho © H3 < Ha < + 


As t increases, Ein(w;) is decreasing, and dvo(H+) is increasing. So, we ex- 
pect to see the approximation-generalization trade-off which was illustrated in 
Figure 2.3 (reproduced here with iteration t a proxy for dvc): 


ont (w: 


Error 











iteration, t 


The figure suggests it may be better to stop early at some t*, well before 
reaching a minimum of Ein. Indeed, this picture is observed in practice. 


Example 7.2. We revisit the digits task of classifying ‘1’ versus all other 
digits, with 70 randomly selected data points and a small sigmoidal neural 
network with a single hidden unit and tanh(-) output node. The figure below 
shows the in-sample error and test error versus iteration number. 
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10? 10° 104 
iteration, t 


The curves reinforce our theoretical discussion: the test error initially decreases 
as the approximation gain overcomes the worse generalization error bar; then, 
the test error increases as the generalization error bar begins to dominate the 
approximation gain, and overfitting becomes a serious problem. 














In the previous example, despite using a parsimonious neural network with 
just a single hidden node, overfitting was an issue because the data are noisy 
and the target function is complex, so both stochastic and deterministic noise 
are significant. We need to regularize. 

In the example, it is better to stop early 
at t* and constrain the learning to the contour of constant Ein 
smaller hypothesis set H». In this sense, 
early stopping is a form of regularization. 
Early stopping is related to weight decay, 
as illustrated to the right. You initialize wo 
near zero; if you stop early at wy» you have 
stopped at weights closer to wo, i.e., smaller 
weights. Early stopping indirectly achieves 
smaller weights, which is what weight decay Wo 
directly achieves. To determine when to stop 
training, use a validation set to monitor the 
validation error at iteration t as you minimize the training-set error. Report 
the weights w» that have minimum validation error when you are done train- 
ing. 

After selecting t*, it is tempting to use all the data to train for t* iterations. 
Unfortunately, adding back the validation data and training for t* iterations 
can lead to a completely different set of weights. The validation estimate of 
performance only holds for w,» (the weights you should output). This appears 
to go against the wisdom of the decreasing learning curve from Chapter 4: if 
you learn with more data, you get a better final hypothesis.!° 








10Using all the data to train to an in-sample error of Etrain(wz* ) is also not recommended. 
Further, an in-sample error of Ftrain(we* ) may not even be achievable with all the data. 
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Exercise 7.12 


Why does outputting w+» rather than training with all the data for t“ 
iterations not go against the wisdom that learning with more data is better. 
[Hint: “More data is better” applies to a fixed model (H, A). Early stop- 
ping is model selection on a nested hypothesis sets Hı C H2 C --- deter- 
mined by Derain. What happens if you were to use the full data D?] 


When using early stopping, the usual trade-off exists for choosing the size of 
the validation set: too large and there is little data to train on; too small 
and the validation error will not be reliable.A rule of thumb is to set aside a 
fraction of the data (one-tenth to one-fifth) for validation. 


Exercise 7.13 


Suppose you run gradient descent for 1000 iterations. You have 500 ex- 

amples in D, and you use 450 for Dirain and 50 for Dyai. You output the 

weight from iteration 50, with Eyai(wso) = 0.05 and Etrain(wso0) = 0.04. 
(a) Is Evai(wso) = 0.05 an unbiased estimate of Hout (w50)? 


(b) Use the Hoeffding bound to get a bound for Hout using Evai plus an 
error bar. Your bound should hold with probability at least 0.1. 


(c) Can you bound Eout using Etrain or do you need more information? 


Example 7.2 also illustrates another common problem with the sigmoidal out- 
put node: gradient descent often hits a flat region where Fi, decreases very 
little.!! You might stop training, thinking you found a local minimum. This 
‘early stopping’ by mistake is sometimes called the ‘self-regularizing’ property 
of sigmoidal neural networks. Accidental regularization due to misinterpreted 
convergence is unreliable. Validation is much better. 


7.4.3 Experiments With Digits Data 


Let’s put theory to practice on the digits task (to classify ‘1’ versus all other 
digits). We learn on 500 randomly chosen data points using a sigmoidal neural 
network with one hidden layer and 10 hidden nodes. There are 41 weights 
(tunable parameters), so more than 10 examples per degree of freedom, which 
is quite reasonable. We use identity output transformation 0(s) = s to reduce 
the possibility of getting stuck at a flat region of the error surface. At the 
end of training, we use the output transformation 0(s) = sign(s) for actually 
classifying data. After more than 2 million iterations of gradient descent, we 
manage to get close to a local minimum. The result is shown in Figure 7.2. 

It doesn’t take a genius to see the overfitting. Figure 7.2 attests to the ap- 
proximation capabilities of a moderately sized neural network. Let’s try weight 


11 The linear output transformation function helps avoid such excessively flat regions. 
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Symmetry 





Average Intensity 


Figure 7.2: 10 hidden unit neural network trained with gradient descent on 
500 examples from the digits data (no regularization). Blue circles are the 
digit ‘1’ and the red x’s are the other digits. Overfitting is rampant. 


decay to fight the overfitting. We minimize E,ug(w, A) = Ein(w) + aw'w, 
with A = 0.01. We get a much more believable separator, shown below. 





Symmetry 





Average Intensity 


As a final illustration, let’s try early stopping with a validation set of size 50 
(one-tenth of the data); so the training set will now have size 450. 

The training dynamics of gradient descent are shown in Figure 7.3(a). The 
linear output transformation function has helped as there are no extremely flat 
periods in the training error. The classification boundary with early stopping 
at t* is shown in Figure 7.3(b). The result is similar to weight decay. In both 
cases, the regularized classification boundary is more believable. Ultimately, 
the quantitative statistics are what matters, and these are summarized below. 
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Symmetry 











Kags 





107 103 104 107 10° j 3 
iteration, t Average Intensity 
(a) Training dynamics (b) Final hypothesis 


Figure 7.3: Early stopping with 500 examples from the digits data. (a) 
Training and validation errors for gradient descent with a training set of 
size 450 and validation set of size 50. (b) The ‘regularized’ final hypothesis 
obtained by early stopping at t*, the minimum validation error. 


train Eval Ein Eout 





No Regularization - - 0.2% 3.1% 
Weight Decay - - 1.0% 2.1% 
Early Stopping 1.1% 2.0% 1.2% 2.0% 


7.5 Beefing Up Gradient Descent 


Gradient descent is a simple method to minimize Fin that has problems con- 
verging, especially with flat error surfaces. One solution is to minimize a 
friendlier error instead, which is why training with a linear output node helps. 
Rather than change the error measure, there is plenty of room to improve 
the algorithm itself. Gradient descent takes a step of size 7 in the negative 
gradient direction. How should we determine 7 and is the negative gradient 
the best direction in which to move? 


Exercise 7.14 


Consider the error function E(w) = (w — w*)"Q(w — w*), where Q is an 
arbitrary positive definite matrix. Set w = 0. 


Show that the gradient VE(w) = —Qw*. What weights minimize E(w). 
Does gradient descent move you in the direction of these optimal weights? 


Reconcile your answer with the claim in Chapter 3 that the gradient is the 
best direction in which to take a step. [Hint: How big was the step?] 


The previous exercise shows that the negative gradient is not necessarily the 
best direction for a large step, and we would like to take larger steps to increase 
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the efficiency of the optimization. The next figure shows two algorithms: our 
old friend gradient descent and our soon-to-be friend conjugate gradient de- 
scent. Both algorithms are minimizing Ej, for a 5 hidden unit neural network 
fitting 200 digits data. The performance difference is dramatic. 


gradient descent 





0.1 1 10 10? 108 104 
optimization time (sec) 


We now discuss methods for ‘beefing up’ gradient descent, but only scratch 
the surface of this important topic known as numerical optimization. The two 
main steps in an iterative optimization procedure are to determine: 

1. Which direction should one search for a local optimum? 


2. How large a step should one take in that direction? 


7.5.1 Choosing the Learning Rate 7 


In gradient descent, the learning rate 7 multiplies the negative gradient to 
give the move —7V Ein. The size of the step taken is proportional to 7. The 
optimal step size (and hence learning rate 7) depends on how wide or narrow 
the error surface is near the minimum. 

















5 5 
© Ein(w) 5 
2 2 
5 £ 
z| g 
f] G 
2 = $ 
A E! 
weights, w weights, w 
wide: use large n. narrow: use small 7. 


When the surface is wider, we can take larger steps without overshooting; 
since ||V Ein|| is small, we need a large 7. Since we do not know ahead of time 
how wide the surface is, it is easy to choose an inefficient value for 7. 
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Variable learning rate gradient descent. A simple heuristic that adapts 
the learning rate to the error surface works well in practice. If the error drops, 
increase 7; if not, the step was too large, so reject the update and decrease n. 
For little extra effort, we get a significant boost to gradient descent. 


Variable Learning Rate Gradient Descent: 


1: Initialize w(0), and ņo at t = 0. Seta > 1 and 8 < 1. 
2: while stopping criterion has not been met do 
3: Let g(t) = VEin(w(t)), and set v(t) = —g(t). 


if Ein(w(t) + mv(t)) < Ein(w(t)) then 
accept: w(t + 1) = w(t) + mv(t); M41 = ane. 
else 
reject: w(t + 1) = w(t); m41 = Bm. 
Iterate to the next step, t = t+ 1. 





It is usually best to go with a conservative increment parameter, for example 
a ~ 1.05 — 1.1, and a bit more aggressive decrement parameter, for example 
B ~ 0.5 — 0.8. This is because, if the error doesn’t drop, then one is in an 
unusual situation and more drastic action is called for. 

After a little thought, one might wonder why we need a learning rate at 
all. Once the direction in which to move, v(t), has been determined, why not 
simply continue along that direction until the error stops decreasing? This 
leads us to steepest descent — gradient descent with line search. 


Steepest Descent. Gradient descent picks a descent direction v(t) = —g(t) 
and updates the weights to w(t + 1) = w(t) + 7v(t). Rather than pick 7 
arbitrarily, we will choose the optimal 7 that minimizes Ei,(w(t + 1)). Once 
you have the direction to move, make the best of it by moving along the line 
w(t) +7v(t) and stopping when Ein is minimum (hence the term line search). 
That is, choose a step size n*, where 

n“ (t) = argmin Ein (w(t) + nv(t)). 


n 


Steepest Descent (Gradient Descent + Line Search): 
1: Initialize w(0) and set t = 0; 
2: while stopping criterion has not been met do 


3: Let g(t) = VEn(w(t)), and set v(t) = —g(t). 
: Let n* = argmin, Fin(w(t) + nv(t)). 
w(t +1) = w(t) + n*v(t). 
Iterate to the next step, t = t+ 1. 





The line search in step 4 is a one dimensional optimization problem. Line 
search is an important step in most optimization algorithms, so an efficient 
algorithm is called for. Write E(m) for Ein(w(t) + nv(t)). The goal is to find 
a minimum of E(7). We give a simple algorithm based on binary search. 
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Line Search. The idea is to find an interval on the line which is guaranteed 
to contain a local minimum. Then, rapidly narrow the size of this interval 
while maintaining as an invariant the fact that it contains a local minimum. 
The basic invariant is a U-arrangement: 
o£ (ns) 


nı < N2 < nz with ` E(m) ; 
E(nz) < min{E(m), E(n3)}- s ; 


Since E is continuous, there must be a local 
minimum in the interval [71,73]. Now, consider 
the midpoint of the interval, 


= 1 
j= zn + 73), 











nı n2 N3 


hence the name bisection algorithm. Suppose 
that 7 < 2 as shown. If E(N) < E(n2) then 
{m, N, n2} is a new, smaller U-arrangement; k ' 
and, if E(#) > E(n2), then {7,72,n3} is the a 
new smaller U-arrangement. In either case, the l 
bisection process can be iterated with the new 
U-arrangement. If 7 happens to equal 72, per- 
turb 7) slightly to resolve the degeneracy. We leave it to the reader to determine 
how to obtain the new smaller U-arrangement for the case 7 > 72. 

An efficient algorithm to find an initial U-arrangement is to start with 
m = 0 and 72 = € for some step €. If E(n2) < E(m), consider the sequence 











nı 7 n2 N3 


n = 0, €, 2e, de, 8e,... 


(each time the step doubles). At some point, the error must increase. When 
the error increases for the first time, the last three steps give a U-arrangement. 
If, instead, E(ņ1) < E(n2), consider the sequence 


n = e, 0, —€, — 2e, —4e, —8e,... 


(the step keeps doubling but in the reverse direction). Again, when the error 
increases for the first time, the last three steps give a U-arrangement.!? 


Exercise 7.15 


Show that |73 — 7] decreases exponentially in the bisection algorithm. 
[Hint: show that two iterations at least halve the interval size.] 


The bisection algorithm continues to bisect the interval and update to a new U- 
arrangement until the length of the interval |73 — | is small enough, at which 


12We do not worry about E(71) = E(n2) — such ties can be broken by small perturbations. 
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point you can return the midpoint of the interval as the approximate local 
minimum. Usually 20 iterations of bisection are enough to get an acceptable 
solution. A better quadratic interpolation algorithm is given in Problem 7.8, 
which only needs about 4 iterations in practice. 


Example 7.3. We illustrate these three heuristics for improving gradient 
descent on our digit recognition (classifying ‘1’ versus other digits). We use 
200 data points and a neural network with 5 hidden units. We show the 
performance of gradient descent, gradient descent with variable learning rate, 
and steepest descent (line search) in Figure 7.4. The table below summarizes 
the in-sample error at various points in the optimization. 


Optimization Time 





Method 10 sec | 1,000 sec | 50,000 sec 
Gradient Descent 0.079 0.0206 0.00144 
Stochastic Gradient Descent | 0.0213 | 0.00278 | 0.000022 
Variable Learning Rate 0.039 0.014 0.00010 
Steepest Descent 0.043 0.0189 0.000012 


— gradient descent 


SGD 


— variable n 


— steepest descent 





0.1 1 10 102 103 104 
optimization time (sec) 
Figure 7.4: Gradient descent, variable learning rate and steepest descent 


using digits data and a 5 hidden unit 2-layer neural network with linear 
output. For variable learning rate, a = 1.1 and 8 = 0.8. 


Note that SGD is quite competitive. The figure illustrates why it is hard to 
know when to stop minimizing. A flat region ‘trapped’ all the methods, even 
though we used a linear output node transform. It is very hard to differentiate 
between a flat region (which is typically caused by a very steep valley that leads 
to inefficient zig-zag behavior) and a true local minimum. 

















© M Abu-Mostafa, Magdon-Ismail, Lin: Jan-2015 e-Chap:7-33 











e-7. NEURAL NETWORKS 7.5. BEEFING UP GRADIENT DESCENT 


7.5.2 Conjugate Gradient Minimization 


Conjugate gradient is a queen among op- 
timization methods because it leverages a 
simple principle. Don’t undo what you 
have already accomplished. When you 
end a line search, because the error can- 
not be further decreased by moving back 
or forth along the search direction, it must 
be that the new gradient and the previous 
line search direction are orthogonal. What ie 
this means is that you have succeeded in 
setting one of the components of the gra- 
dient to zero, namely the component along 
the search direction v(t) (see the figure). 
If the next search direction is the negative 
of the new gradient, it will be orthogonal to the previous search direction. 
You are at a local minimum when the gradient is zero, and setting one 
component to zero is certainly a step in the right direction. As you move 
along the next search direction (for example the new negative gradient), the 
gradient will change and may not remain orthogonal to the previous search 
direction, a task you laboriously accomplished in the previous line search. 
The conjugate gradient algorithm chooses the next direction v(t + 1) so that 
the gradient along this direction, will remain perpendicular to the previous 
search direction v(t). This is called the conjugate direction, hence the name. 
After a line search along this new direction 
v(t+1) to minimize Fin, you will have set 
two components of the gradient to zero. 
First, the gradient remained perpendicu- 
lar to the previous search direction v(t). Fs 
Second, the gradient will be orthogonal to Jo) 
v(t+ 1) because of the line search (see the Ay 
figure). The gradient along the new direc- 
tion v(t + 1) is shown by the blue arrows 
in the figure. Because v(t + 1) is conju- 
gate to v(t), observe how the gradient as 


we move along v(t+1) remains orthogonal 
to the previous direction v(t). 


contour of constant Fin 


Q 








w 


contour of constant Fin 


w2 








Exercise 7.16 


Why does the new search direction pass through the optimal weights? 


We made progress! Now two components of the gradient are zero. In two 
dimensions, this means that the gradient itself must be zero and we are done. 
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In higher dimension, if we could continue to set a component of the gradient 
to zero with each line search, maintaining all previous components at zero, we 
will eventually set every component of the gradient to zero and be at a local 
minimum. Our discussion is true for an idealized quadratic error function. In 
general, conjugate gradient minimization implements our idealized expecta- 
tions approximately. Nevertheless, it works like a charm because the idealized 
setting is a good approximation once you get close to a local minimum, and 
this is where algorithms like gradient descent become ineffective. 

Now for the details. The algorithm constructs the current search direc- 
tion as a linear combination of the previous search direction and the current 
gradient, 


v(t) = —g(t) + m: v(t — 1), 
where 
pe g(t + 1)"(g(t + 1) — g(t) 
' g(t)*g(t) 


The term pz: v(t—1) is called the momentum term because it asks you to keep 
moving in the same direction you were moving in. The multiplier w is called 
the momentum parameter. The full conjugate gradient descent algorithm is 
summarized in the following algorithm box. 


Conjugate Gradient Descent: 
1: Initialize w(0) and set t = 0; set v(—1) = 0 
2: while stopping criterion has not been met do 
3: Let v(t) = —g(t) + uev(t — 1), where 


_ s+1)(st+ 1) - g) 


H ey 
; g(t)*g(t) 
Let n* = argmin, Ein(w(t) + nv(t)). 
w(t +1) = w(t) + n*v(t); 

Iterate to the next step, t + t + 1; 





The only difference between conjugate gradient descent and steepest descent is 
in step 3 where the line search direction is different from the negative gradient. 
Contrary to intuition, the negative gradient direction is not always the best 
direction to move, because it can undo some of the good work you did before. 

In practice, for error surfaces that are not exactly quadratic, the v(t)’s are 
only approximately conjugate and it is recommended that you ‘restart’ the 
algorithm by setting u+ to zero every so often (for example every d iterations). 
That is, every d iterations you throw in a steepest descent iteration. 


Example 7.4. Continuing with the digits example, we compare conjugate 
gradient and the previous champion steepest descent in the next table and 
Figure 7.5. 
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conjugate gradients 





0.1 1 10 102 103 104 
optimization time (sec) 


Figure 7.5: Steepest descent versus conjugate gradient descent using 200 
examples of the digits data and a 2-layer sigmoidal neural network with 5 
hidden units. 


Optimization Time 
Method 10 sec 1,000 sec 50,000 sec 





Steepest Descent 0.043 0.0189 1.2 x 107 
Conjugate Gradients | 0.0200 | 1.13 x 107ê | 2.73 x 107°? 











The performance difference is dramatic. 





7.6 Deep Learning: Networks with Many Layers 


Universal approximation says that a single hidden layer with enough hidden 
units can approximate any target function. But, that may not be a natural 
way to represent the target function. Often, many layers more closely mimics 
human learning. Let’s get our feet wet with the digit recognition problem to 
classify ‘1’ versus ‘5’. A natural first step is to decompose the two digits into 
basic components, just as one might break down a face into two eyes, a nose, 
a mouth, two ears, etc. Here is one attempt for a prototypical ‘1’ and ‘5’. 


a rr 
| _ EF j 


Qı 2 Q3 Qa Q5 Qe 
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Indeed, we could plausibly argue that every ‘1’ should contain a ¢1, ¢2 and ¢3; 
and, every ‘5’ should contain a @3, 64, ¢5, Ọs and perhaps a little ¢;. We have 
deliberately used the notation ¢; which we used earlier for the coordinates of 
the feature transform ®. These basic shapes are features of the input, and, for 
example, we would like ¢; to be large (close to 1) if its corresponding feature 
is in the input image and small (close to -1) if not. 


Exercise 7.17 


The basic shape ¢3 is in both the ‘1’ and the ‘5’. What other digits do 
you expect to contain each basic shape ¢1---¢6. How would you select 
additional basic shapes if you wanted to distinguish between all the digits. 
(What properties should useful basic shapes satisfy?) 


We can build a classifier for ‘1’ versus ‘5’ from these basic shapes. Remember 
how, at the beginning of the chapter, we built a complex Boolean function 
from the ‘basic’ functions AND and OR? Let’s mimic that process here. The 
complex function we are building is the digit classifier and the basic functions 
are our features. Assume, for now, that we have feature functions ¢; which 
compute the presence (+1) or absence (—1) of the corresponding feature. Take 
a close look at the following network and work it through from input to output. 


+ve weight 










—ve weight 








is it a 1S — —~— is ita ‘5’? 


Ignoring details like the exact values of the weights, node zı answers the 
question “is the image a ‘1’?” and similarly node z5 answers “is the image a 
‘5’2?” Let’s see why. If they have done their job correctly when we feed in a 
‘1’, b1, 62,63 compute +1, and ¢4, 65,¢6 compute —1. Combining ¢1,...,¢6 
with the signs of the weights on outgoing edges, all the inputs to zı will be 
positive hence zı outputs +1; all but one of the inputs into z5 are negative, 
hence z5 outputs —1. A similar analysis holds if you feed in the ‘5’. The final 
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node combines z; and zs to the final output. At this point, it is useful to fill 
in all the blanks with an exercise. 


Exercise 7.18 


Since the input x is an image it is convenient to represent it as a matrix 
[xij] of its pixels which are black (xi; = 1) or white (x;; = 0). The basic 
shape ¢; identifies a set of these pixels which are black. 


(a) Show that feature y can be computed by the neural network node 


k(x) = tanh (o + Suves) 3 


ij 
(b) What are the inputs to the neural network node? 


(c) What do you choose as values for the weights? [Hint: consider sepa- 
rately the weights of the pixels for those xij € $p and those xij Z bx.) 


(d) How would you choose wo? (Not all digits are written identically, and 
so a basic shape may not always be exactly represented in the image.) 


(e) Draw the final network, filling in as many details as you can. 


You may have noticed, that the output of z1 is all we need to solve our problem. 
This would not be the case if we were solving the full multi-class problem 
with nodes z0,...,29 corresponding to all ten digits. Also, we solved our 
problem with relative ease — our ‘deep’ network has just 2 hidden layers. In 
a more complex problem, like face recognition, the process would start just 
as we did here, with basic shapes. At the next level, we would constitute 
more complicated shapes from the basic shapes, but we would not be home 
yet. These more complicated shapes would constitute still more complicated 
shapes until at last we had realistic objects like eyes, a mouth, ears, etc. There 
would be a hierarchy of ‘basic’ features until we solve our problem at the very 
end. 

Now for the punch line and crux of our story. The punch line first. Shine 
your floodlights back on the network we constructed, and scrutinize what the 
different layers are doing. The first layer constructs a low-level representation 
of basic shapes; the next layer builds a higher level representation from these 
basic shapes. As we progress up more layers, we get more complex repre- 
sentations in terms of simpler parts from the previous layer: an ‘intelligent’ 
decomposition of the problem, starting from simple and getting more complex, 
until finally the problem is solved. This is the promise of the deep network, 
that it provides some human-like insight into how the problem is being solved 
based on a hierarchy of more complex representations for the input. While 
we might attain a solution of similar accuracy with a single hidden layer, we 
would gain no such insight. The picture is rosy for our intuitive digit recog- 
nition problem, but here is the crux of the matter: for a complex learning 
problem, how do we automate all of this in a computer algorithm? 
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Figure 7.6: Greedy deep learning algorithm. (a) First layer weights are 
learned. (b) First layer is fixed and second layer weights are learned. (c) 
First two layers are fixed and third layer weights are learned. (d) Learned 
weights can be used as a starting point to fine-tune the entire network. 


7.6.1 A Greedy Deep Learning Algorithm 


Historically, the shallow (single hidden layer) neural network was favored over 
the deep network because deep networks are hard to train, suffer from many 
local minima and, relative to the number of tunable parameters, they have a 
very large tendency to overfit (composition of nonlinearities is typically much 
more powerful than a linear combination of nonlinearities). Recently, some 
simple heuristics have shown good performance empirically and have brought 
deep networks back into the limelight. Indeed, the current best algorithm for 
digit recognition is a deep neural network trained with such heuristics. 


The greedy heuristic has a general form. Learn the first layer weights 
W) and fix them.!3 The output of the first hidden layer is a nonlinear 
transformation of the inputs Xn > xP . These outputs x) are used to train 
the second layer weights W), while keeping the first layer weights fixed. This 
is the essence of the greedy algorithm, to ‘greedily’ pick the first layer weights, 
fix them, and then move on to the second layer weights. One ignores the 
possibility that better first layer weights might exist if one takes into account 
what the second layer is doing. The process continues with the outputs x?) 
used to learn the weights W“), and so on. 


13Recall that we use the superscript (-) to denote the layer 2. 
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Greedy Deep Learning Algorithm: 
1: for L= 1,..., L do 
: WO...W) are given from previous iterations. 


Compute layer £ — 1 outputs xi) forn=1,...,N. 


Use {x60} to learn weights W‘ by training a single 
hidden layer neural network. (W® .-. W() are fixed.) 


( 





N 


output 


- error measure 
hidden layer 








We have to clarify step 4 in the algorithm. The weights W and V are 
learned, though V is not needed in the algorithm. To learn the weights, we 
minimize an error (which will depend on the output of the network), and that 
error is not yet defined. To define the error, we must first define the output 
and then how to compute the error from the output. 


Unsupervised Auto-encoder. One approach is to take to heart the notion 
that the hidden layer gives a high-level representation of the inputs. That is, 
we should be able to reconstruct all the important aspects of the input from 
the hidden layer output . A natural test is to reconstruct the input itself: the 
output will be $n, a prediction of the input xn; and, the error is the difference 
between the two. For example, using squared error, 


A 2 
en = ||Xn — Xn l|“. 


When all is said and done, we obtain the weights without using the targets 
Yn and the hidden layer gives an encoding of the inputs, hence the name 
unsupervised auto-encoder. This is reminiscent of the radial basis function 
network in Chapter 6, where we used an unsupervised technique to learn the 
centers of the basis functions, which provided a representative set of inputs 
as the centers. Here, we go one step further and dissect the input-space itself 
into pieces that are representative of the learning problem. At the end, the 
targets have to be brought back into the picture (usually in the output layer). 


Supervised Deep Network. The previous approach adheres to the philo- 
sophical goal that the hidden layers provide an ‘intelligent’ hierarchical rep- 
resentation of the inputs. A more direct approach is to train the two-layer 
network on the targets. In this case the output is the predicted target n and 
the error measure en (Yn, n) would be computed in the usual way (for example 
squared error, cross entropy error, etc.). 
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In practice, there is no verdict on which method is better, with the unsu- 
pervised auto-encoder camp being slightly more crowded than the supervised 
camp. Try them both and see what works for your problem, that’s usually 
the best way. Once you have your error measure, you just reach into your 
optimization toolbox and minimize the error using your favorite method (gra- 
dient descent, stochastic gradient descent, conjugate gradient descent, ...). A 
common tactic is to use the unsupervised auto-encoder first to set the weights 
and then fine tune the whole network using supervised learning. The idea is 
that the unsupervised pass gets you to the right local minimum of the full 
network. But, no matter which camp you belong to, you still need to choose 
the architecture of the deep network (number of hidden layers and their sizes), 
and there is no magic potion for that. You will need to resort to old tricks 
like validation, or a deep understanding ©) of the problem (our hand made 
network for the ‘1’ versus ‘5’ task suggests a deep network with six hidden 
nodes in the first hidden layer and two in the second). 


Exercise 7.19 


Previously, for our digit problem, we used symmetry and intensity. How do 
these features relate to deep networks? Do we still need them? 


Example 7.5. Deep Learning For Digit Recognition. Let’s revisit the 
digits classification problem ‘1’ versus ‘5’ using a deep network architecture 


[d© dd, d) = [256, 6, 2, 1]. 


(The same architecture we constructed by hand earlier, with 16 x 16 input 
pixels and 1 output.) We will use gradient descent to train the two layer 
networks in the greedy algorithm. A convenient matrix form for the gradient 
of the two layer network is given in Problem 7.7. For the unsupervised auto- 
encoder the target output is the input matrix X. for the supervised deep 
network, the target output is just the target vector y. We used the supervised 
approach with 1,000,000 gradient descent iterations for each supervised greedy 
step using a sample of 1500 examples from the digits data. Here is a look at 
what the 6 hidden units in the first hidden layer learned. For each hidden 
node in the first hidden layer, we show the pixels corresponding to the top 20 
incoming weights. 


























a =- a - sy a 
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Real data is not as clean as our idealized analysis. Don’t be surprised. Never- 
theless, we can discern that ¢2 has picked out the pixels (shapes) in the typical 
‘T’ that are unlikely to be in a typical ‘5’. The other features seem to focus 
on the ‘5’ and to some extent match our hand constructed features. Let’s not 
dwell on whether the representation captures human intuition; it does to some 
extent. The important thing is that this result is automatic and purely data 
driven (other than our choice of the network architecture); and, what matters 
is out-of-sample performance. For different architectures, we ran more than 
1000 validation experiments selecting 500 random training points each time 
and the remaining data as a test set. 


Deep Network Architecture | Fin Eresi 





[256,3, 2, 1] 0 0.170% 
(256, 6, 2, 1] 0 0.187% 
256, 12,2, 1] 0 0.187% 
256, 24, 2, 1] 0 0.183% 


Ein is always zero because there are so many parameters, even with just 3 
hidden units in the first hidden layer. This smells of overfitting. But, the test 
performance is impressive at 99.8% accuracy, which is all we care about. Our 
hand constructed features of symmetry and intensity were good, but not quite 
this good. 
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7.7 Problems 


Problem 7.1 Implement the decision function below using a 3-layer 
perceptron. 
































Problem 7.2 A set of M hyperplanes will generally divide the space 
into some number of regions. Every point in R? can be labeled with an M 
dimensional vector that determines which side of each plane it is on. Thus, for 
example if M = 3, then a point with a vector (—1,+1,+1) is on the -1 side of 
the first hyperplane, and on the +1 side of the second and third hyperplanes. 
A region is defined as the set of points with the same label. 


(a) Prove that the regions with the same label are convex. 
(b) Prove that M hyperplanes can create at most 2™ distinct regions. 


(c) [hard] What is the maximum number of regions created by M hyper- 
planes in d dimensions? 
d 
[Answer: >> (“1) .] 
i=0 
[Hint: Use induction and let B(M,d) be the number of regions created 
by M (d — 1)-dimensional hyperplanes in d-space. Now consider adding 
the (M + 1)th hyperplane. Show that this hyperplane intersects at most 
B(M,d-— 1) of the B(M,d) regions. For each region it intersects, it 
adds exactly one region, and so B(M +1,d) < B(M,d)+ B(M,d— 1). 
(Is this recurrence familiar?) Evaluate the boundary conditions: B(M, 1) 
and B(1, d), and proceed from there. To see that the M +1th hyperplane 
only intersects B(M,d—1) regions, argue as follows. Treat the M + 1th 
hyperplane as a (d — 1)-dimensional space, and project the initial M hy- 
perplanes into this space to get M hyperplanes in a (d — 1)-dimensional 
space. These M hyperplanes can create at most B(M,d — 1) regions 
in this space. Argue that this means that the M + 1th hyperplane is 
only intersecting at most B(M,d — 1) of the regions created by the M 
hyperplanes in d-space.] 
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Problem 7.3 Suppose that a target function f (for classification) is 
represented by a number of hyperplanes, where the different regions defined by 
the hyperplanes (see Problem 7.2) could be either classified +1 or —1, as with 
the 2-dimensional examples we considered in the text. Let the hyperplanes be 
hi, h2,..., hm, where hm(x) = sign(wm * x). Consider all the regions that 
are classified +1, and let one such region be r*. Let c = (c1,C2,...,¢nr) be 
the label of any point in the region (all points in a given region have the same 
label); the label cm = +1 tells which side of hm the point is on. Define the 
AND-term corresponding to region r by 





hm if Cm = +1 
tr = hi hy... RS, where hy” = 4 - X 
a aa if Cm = —1. 


Show that f = tr, +tro +---+tr,, where r1,...,7% are all the positive regions. 
(We use multiplication for the AND and addition for the OR operators.) 


Problem 7.4 Referring to Problem 7.3, any target function which can 
be decomposed into hyperplanes hi,..., ag can be represented by f = tr, + 
try +-+++tr,, where there are k positive regions. 


What is the structure of the 3-layer perceptron (number of hidden units in each 
layer) that will implement this function, proving the following theorem: 





Theorem. Any decision function whose +1 regions are defined in terms of 
the regions created by a set of hyperplanes can be implemented by a 3-layer 
perceptron. 


Problem 7.5 [Hard] State and prove a version of a Universal Approxi- 
mation Theorem: 


Theorem. Any target function f (for classification) defined on [0, 1]“, whose 
classification boundary surfaces are smooth, can arbitrarily closely be approxi- 
mated by a 3-layer perceptron. 


[Hint: Decompose the unit hypercube into e-hypercubes (a of them); The 
volume of these «-hypercubes which intersects the classification boundaries 
must tend to zero (why? — use smoothness). Thus, the function which takes 
on the value of f on any e-hypercube that does not intersect the boundary and 
an arbitrary value on these boundary ¢-hypercubes will approximate f arbitrarily 
closely, as e —> 0. ] 


Problem 7.6 The finite difference approximation to obtaining the gradient 
is based on the following formula from calculus: 
dh _ hw) +e- hl) — 6) 


a) 2 
Ow;; E 


+0(®), 
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(6) 
ij 
their values in w except for the weight w, which is perturbed by e. To get 
the gradient, we need the partial derivative with respect to each weight. 


where h(w;,’ + €) to denotes the function value when all weights are held at 


Show that the computational complexity of obtaining all these partial deriva- 
tives O(W?). [Hint: you have to do two forward propagations for each weight.] 


Problem 7.7 Consider the 2-layer network below, with output vector y. 
This is the two layer network used for the greedy deep network algorithm. 








‘a Zn, + en = lyn = ŷn le 





Xn ) 


Va a 
\ Pd 


Collect the input vectors x, (together with a column of ones) as rows in the 
input data matrix X, and similarly form Z from zn. The target matrices Y and 
Y are formed from Yn and Yn respectively. Assume a linear output node and 
the hidden layer activation is 0(-). 


(a) Show that the in-sample error is 


Ein = trace (Y —- Ý)(Y — v) ; 


where 
X is Nx (d+1) 
j is (d+1) xd? 
xa - % ea Z is Nx (d +1) 
Z = [1,0(XW)] 


V= Pal is (d® +1) x dim(y) 
Y,Y are N x dim(y) 


(It is convenient to decompose V into its first row Vo corresponding to 
the biases and its remaining rows V1; 1 is the N x 1 vector of ones.) 


(b) derive the gradient matrices: 








OE in P T 
= 2Z ZV-—2Z Y 
ƏV y 
Ein T / dd T T 
aw T 2X (0 (XW) 8 (A(XW)V1Vi + 1V0Vi — YV})| i 


where ® denotes element-wise multiplication. Some of the matrix deriva- 
tives of functions involving the trace from the appendix may be useful. 
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Problem 7.8 Quadratic Interpolation for Line Search 


Assume that a U-arrangement has been found, as illustrated below. 


E(n) 


Quadratic 
eo 

















nı n n2 n3 


Instead of using bisection to construct the point 7, quadratic interpolation fits a 
quadratic curve E(1) = an? +bn +c to the three points and uses the minimum 
of this quadratic interpolation as 7. 


(a) Show that the minimum of the quadratic interpolant for a U-arrangement 
is within the interval [71,73]. 


(b) Let e: = E(m), e2 = E(n2), e3 = E(n3). Obtain the quadratic function 


that interpolates the three points {(71, e1), (72, €2), (73, e3)}. Show that 
the minimum of this quadratic interpolant is given by: 





g= | fea) — ns) = (e1 — €3)(m — m) 
2 | (e1 — e2)(m — n3) — (e1 — e3)(m — n2) 





[Hint: e1 = an? +bm +c, e2 = ang +bn2 +c, e3 = ang +bna +c. Solve 
for a,b,c and the minimum of the quadratic is given by 7 = —b/2a. ] 


(c) Depending on whether F(7) is less than E(n2), and on whether 7 is to 
the left or right of 72, there are 4 cases. 


In each case, what is the smaller U-arrangement? 


(d) What if 7 = 72, a degenerate case? 


Note: in general the quadratic interpolations converge very rapidly to a locally 
optimal 7. In practice, 4 iterations are more than sufficient. 


Problem 7.9 [Convergence of Monte-Carlo Minimization] 
Suppose the global minimum w* is in the unit cube and the error surface is 
quadratic near w*. So, near w*, 


E(w) = Fw") + 5(w — w*)TH(w-— w*) 


where the Hessian H is a positive definite and symmetric. 
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(a) If you uniformly sample w in the unit cube, show that 





P Sa(2e) 
P|E < E(w*)+¢= Jiss ; 
vdet H 
xT Hx<2e i 


where Sa(r) is the volume of the d—dimensional sphere of radius r, 
Salr) = 19? 4 /T(4 +1). 


[Hints: P[E < E(w*) + q = P[4(w—w*)"H(w—w’*) <e]. Sup- 
pose the orthogonal matrix A diagonalizes H: ATHA = diag[Aj,..., 3]. 
Change variables to u = A™x and use det H = AjA3--- AZ] 


(b) Suppose you sample M times and choose the weights with minimum 
error, Wmin. Show that 


P[B(Wmin) > Bw") +d © (: -E (3) ) y 


where u œ y/8er/A and À is the geometric mean of the eigenvalues of H. 
(You may use F(x + 1) © a%e7~* V 272.) 

(c) Show that if N ~ (2) log 4, then with probability at least 1 — 7, 
E(Wmin) < E(w*) +€. 
(You may use log(1 — a) ~ —a for small a and (xd)'/¢ ~ 1.) 


Problem 7.10 For a neural network with at least 1 hidden layer and 
tanh(-) transformations in each non-input node, what is the gradient (with 
respect to the weights) if all the weights are set to zero. 


Is it a good idea to initialize the weights to zero? 


Problem 7.11 [Optimal Learning Rate] Suppose that we are in the 
vicinity of a local minimum, w*, of the error surface, or that the error surface 
is quadratic. The expression for the error function is then given by 


E(w:) = E(w*) 4 iow w*)H(w:— w") (7.8) 





from which it is easy to see that the gradient is given by g: = H(w+—w*). The 
weight updates are then given by wi41 = w: — 7H(w: — w*), and subtracting 
w* from both sides, we see that 


c41 = (I — nH)e: (7.9) 


Since H is symmetric, one can form an orthonormal basis with its eigenvectors. 
Projecting e+ and e++1 onto this basis, we see that in this basis, each component 
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decouples from the others, and letting e(a) be the a*” component in this basis, 
we see that 

e+1(a) = (1 — ra )et(@) (7.10) 
so we see that each component exhibits linear convergence with its own co- 
efficient of convergence ka = 1 — na. The worst component will dominate 
the convergence so we are interested in choosing 7 so that the ka with largest 
magnitude is minimized. Since H is positive definite, all the \.'s are positive, 
so it is easy to see that one should choose 77 so that 1 — nAmin = 1 — A and 
1—7Amax = 1+A, or one should choose. Solving for the optimal 7, one finds 


that 
2 l-e 


Nene. ER 
where c = Amin/Amax is the condition number of H, and is an important 
measure of the stability of H. When c % 0, one usually says that H is ill- 
conditioned. Among other things, this affects the one’s ability to numerically 
compute the inverse of H. 


(7.11) 


Nopt = 


Problem 7.12 [Hard] With a variable learning rate, suppose that 
ne — 0 satisfying S>1/m = co and $` 1/n? < oo, for example one could 
t t 


choose m = 1/(t + 1). Show that gradient descent will converge to a local 
minimum. 


Problem 7.13 [Finite Difference Approximation to Hessian] 


(a) Consider the function E(w1, w2). Show that the finite difference approx- 
imation to the second order partial derivatives are given by 








@E _ E(wı+2h,w2)+E(w1—2h,w2)—2E(w1,w2) 

aw? < ah2 

3E _ Bwi,w2+2h)+E(wi,w2—2h)-2E(wi,w2) 

awe = Ah2 

3E _ Ewith,weth)t+E(wi—h,w2—h)—E(with,w2—h)—E(wi-h,weth) 
Owe ~~ 4h2 


(b) Give an algorithm to compute the finite difference approximation to the 
Hessian matrix for Ein(w), the in-sample error for a multilayer neural 
network with weights w = [W“,..., W}. 


(c) Compute the asymptotic running time of your algorithm in terms of the 
number of weights in your network and then number of data points. 


Problem 7.14 Suppose we take a fixed step in some direction, we 
ask what the optimal direction for this fixed step assuming that the quadratic 
model for the error surface is accurate: 


Ein(wi + Ow) = Ein(w:) + g: Aw + 5 Aw HAw. 
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So we want to minimize Ein(Aw) with respect to Aw subject to the constraint 
that the step size is 7, i.e., that Aw’ Aw = n°. 


(a) Show that the Lagrangian for this constrained minimization problem is: 
L = Emn(w:) + g Aw + 5 Aw" (H + 2al) Aw, (7.12) 


where a is the Lagrange multiplier. 


(b) Solve for Aw and a and show that they satisfy the two equations: 


Aw = —(H: + 2a1)7'g:, 
Aw'Aw = r. 


(c) Show that a satisfies the implicit equation: 


ll 


a= 2m2 


(Awg: + Aw'H;Aw). 


Argue that the second term is 0(1) and the first is O(~ ||ge||/7). So, a 
is large for a small step size 7. 


(d) Assume that a is large. Show that, To leading order in + 


o lel 
2n 

Therefore a is large, consistent with expanding Aw to leading order in 

4. [Hint: expand Aw to leading order in +.] 


-1 
(e) Using (d), show that Aw = — (H: + ledi) gi. 


Problem 7.15 The outer-product Hessian approximation is H = 
Da Engan. Let Hy = Sa gngn be the partial sum to k, and let Hy 
be its inverse. 


Hy gr+1gk+1 Hy" 
1+ SiiHy Bett 

5 —3 et =i z" —1 
[Hints: Hk+1 = Hk +8gk+18gk+1; and, (A+zz") EEAS cee. 


(a) Show that He =H,'- 


(b) Use part (a) to give an O(NW7”) algorithm to compute H;', the same 
time it takes to compute H. (W is the number of dimensions in g). 


Note: typically, this algorithm is initialized with Ho = el for some small €. So 
the algorithm actually computes (H + «1)~'; the results are not very sensitive 
to the choice of e€, as long as € is small. 
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7.7. PROBLEMS 


Problem 7.16 In the text, we computed an upper bound on the VC- 
dimension of the 2-layer perceptron is dvo = O(mdlog(md)) where m is the 
number of hidden units in the hidden layer. Prove that this bound is essentially 
tight by showing that dvo = Q(md). To do this, show that it is possible to find 
md points that can be shattered when m is even as follows. 


Consider any set of N points x1,...,xw in general position with N = md. 
N points in d dimensions are in general position if no subset of d + 1 points 
lies on a d— 1 dimensional hyperplane. Now, consider any dichotomy on these 
points with r of the points classified +1. Without loss of generality, relabel the 
points so that x1,...,X, are +1. 


(a) Show that without loss of generality, you can assume that r < N/2. For 
the rest of the problem you may therefore assume that r < N/2. 


(b) Partition these r positive points into groups of size d. The last group 
may have fewer than d points. Show that the number of groups is at 
most x. Label these groups D; fori = 1...q < N/2. 


(c) Show that for any subset of k points with k < d, there is a hyperplane 
containing those points and no others. 


(d) By the previous part, let w;,b; be the hyperplane through the points in 
group D;, and containing no others. So 


wixXn+bi =0 


if and only if xn € Di. Show that it is possible to find h small enough so 
that for xn € Di, 


and for x, ¢ Di 
[Wi Xn + bil >h. 


(e) Show that for xn € Di, 





sign(w; Xn + bi +h) + sign(—w; xn — bi +h) = 2, 
and for x, ¢ Di 


sien(w, Xn + bi + h) + sign(—w; xn — bi +h) = 0 





(f) Use the results so far to construct a 2-layer MLP with 2r hidden units 
which implements the dichotomy (which was arbitrary). Complete the 
argument to show that dvo > md. 
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